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1 Abstract 

2 Short tandem repeat (STR) variation has been proposed as a major explanatory factor 

3 in the heritability of complex traits in humans and model organisms. However, we still 

4 struggle to incorporate STR variation into genotype-phenotype maps. Here, we review 

5 the promise of STRs in contributing to complex trait heritability, and highlight the 

6 challenges that STRs pose due to their repetitive nature. We argue that STR variants 

7 are more likely than single nucleotide variants to have epistatic interactions, reiterate the 

8 need for targeted assays to accurately genotype STRs, and call for more appropriate 

9 statistical methods in detecting STR-phenotype associations. Lastly, somatic STR 

10 variation within individuals may serve as a read-out of disease susceptibility, and is thus 

11 potentially a valuable covariate for future association studies. 
12 

13 The 'missing heritability' of complex diseases and STR variation. 

14 Complex diseases such as diabetes, various cancers, cardiovascular disease, 

15 and neurological disorders cluster in families, and are thus considered to have a genetic 

16 component [1-3] (Glossary). The identification of these genetic factors has proven 

17 challenging; although genome-wide association (GWA) studies have identified many 

18 genetic variants that are associated with complex diseases, these generally confer less 

19 disease risk than expected from empirical estimates of heritability. This discrepancy, 

20 termed the 'missing heritability', has been attributed to many factors [1-6]. A trivial 

21 explanation is that shared environments among relatives may artificially inflate 

22 estimates of heritability. However, missing heritability may also be due to variants in the 
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1 human genome that are currently inaccessible at a population scale [1 ,2]. One such 

2 class of variation is short tandem repeat (STR) unit number variation. Some have 

3 previously suggested that adding STR variation to existing genetic models would 

4 considerably increase the proportion of heritability explained by genetic factors in 

5 human disease [7,8]. Three percent of the human genome consists of STRs [9] and 6% 

6 of human coding regions are estimated to contain STR variation [10,1 1]. Recently, the 

7 first catalog of genome-wide population-scale human STR variation has appeared [12], 

8 opening up new possibilities for understanding the contribution of STRs to human 

9 genetic diseases. This catalog, and similar data sources [13], have appeared decades 

10 after initial calls for the assessment of the role of STRs in phenotypic variation [14], 

11 lagging behind surveys of other genomic elements. Much of the initial interest in STRs 

12 was generated by the discovery of phenomena such as genetic anticipation, which are 

13 mediated by the unique features of STRs [15]. As we will discuss, new and forthcoming 

14 data sources will help to realize the long-deferred promise of STRs for explaining 

15 heritability. 

16 STRs consist of short (2-10 bp) DNA sequences (units) that are repeated head- 

17 to-tail multiple times. This structure causes frequent errors in recombination and 

18 replication that add or subtract units, leading to STR mutation rates that are 10-fold to 

19 10 4 -fold higher than those of non-repetitive loci [16,17]. Due to technical barriers, STR 

20 variation has until very recently remained inaccessible to genome-wide assessment. 

21 STRs are often conserved (even if their unit number or even sequence changes), 

22 especially in coding sequences [18-21]. In both humans and the yeast Saccharomyces 
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1 cerevisiae, promoter regions are known to be dramatically enriched for STRs [22,23]. In 

2 coding regions, STRs tend to occur in genes with roles in transcriptional regulation, DNA 

3 binding, protein-protein binding, and developmental processes [16,21 , 22]. These 

4 consistent functional enrichments across vastly diverged lineages suggest important 

5 functional roles for STRs. 

6 Indeed, analysis of STR variation in the Drosophila Genetic Reference Panel 

7 identified dozens of associations between STR variants and quantitative phenotypes in 

8 recombinant inbred fly lines [13]. Moreover, accumulating evidence from exhaustive 

9 genetic studies shows that STR variation has dramatic, often background-dependent 

10 phenotypic effects in model organisms [25-29]. Together, these findings suggest that 

11 STR variation has the potential to dramatically revise the heritability estimates 

12 attributable to genetic factors. 

13 The high STR mutation rate also leads to substantial somatic variation of STR 

14 loci within individuals. In fact, this somatic variation, also called microsatellite instability 

15 (MSI), has been used for decades as a biomarker for different classes of cancer [30]. 

16 Recent studies demonstrate that organisms exposed to various environmental stresses 

17 and perturbations show increased genome instability, including MSI [31-34]. MSI may 

18 be useful as a biomarker for cellular stress states that may predispose to disease. 

19 The broad interest in STR variation has led to the development of techniques for 

20 high-throughput genotyping of STRs [35,36] and an explosion of analysis tools for 

21 extracting STR variation from existing sequence data [37-39]. However, the precision of 

22 these methods remains limited, due to a combination of low effective coverage of STRs 
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1 and the lack of robust models for distinguishing technical error from somatic variation. 

2 Attempts to use STR variation for GWA in a fashion equivalent to SNV variation may be 

3 underpowered and confounded by the unique characteristics of this class of variants. In 

4 this review, we discuss the latest advances in these fields, and lay out a set of priorities 

5 for the future study of STRs. 
6 

7 STR variation is associated with human genetic diseases 

8 Within coding regions, STR mutations are generally in-frame additions and 

9 subtractions of repeat units, resulting in proteins with variable, low-complexity amino 

10 acid runs [21]. These mutations can result in phenotypic effects and lead to genetic 

11 disorders; several neurological diseases (spinocerebellar ataxias, Huntington's disease, 

12 spinobulbar muscular atrophy, dentatorubral-pallidoluysian atrophy, intellectual 

13 disability, etc.) are a consequence of dramatically expanded STR alleles [7,40,41]. 

14 Many of these disease-associated STR expansions behave as dominant gain-of- 

15 function mutations [7]. However, even comparatively modest coding STR variation may 

16 confer disease risk or behavioral phenotypes, according to a variety of single-marker 

17 association studies [42-45]; for instance, variants in separate coding STRs in RUNX2 

18 are associated with defects in bone mineralization, higher incidence of fractures [46,47]; 

19 STR variation in this gene in dogs is also associated with craniofacial phenotypes [48]. 

20 Noncoding STR variation in regulatory sequences can affect transcription, RNA stability, 

21 and chromatin organization. For instance, certain STR variants alter CFTR expression 

22 and thus cystic fibrosis status [1 6]. We take these studies as evidence that STR 
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1 variation, even in the absence of large expansions, may contribute significantly to the 

2 heritability of human traits and genetic diseases. 

3 The severity of the STR expansion-associated diseases may suggest that natural 

4 selection should eliminate STRs in functional regions, but several recent studies across 

5 many organisms indicate that variable STRs are globally maintained [1 9,20,24,49,50]. 

6 For example, the pre-expansion polyQ-encoding STR in the human gene SCA2 is under 

7 positive selection, suggesting that this variable STR is actively maintained in spite of the 

8 pathogenic expansions that do occasionally occur and cause spinocerebellar ataxia 

9 [51]. Considering both the evidence of positive selection on STRs and the functional 

10 enrichments of STR-containing genes, several authors have proposed that functional 

11 STRs are maintained because they confer 'evolvability', or the capacity for fast 

12 adaptation [21 ,22,52-54]. This suggestion is intriguing, in part because many STR 

13 mutations are dominant, and, when beneficial, can quickly sweep to fixation. Although 

14 we do not further discuss these evolutionary considerations here, they underscore the 

15 phenotypic potential of STR variation. 
16 

17 STR variation has dramatic background-dependent effects on phenotype 

18 To date, the functional consequences of unit number variation in selected STRs 

19 have been studied in plants, fungi, flies, voles, dogs, and fish [25,27,28,55-57], among 

20 other organisms. In Saccharomyces cerevisiae, STR unit number in the FL01 gene 

21 accurately predicts the phenotype of cell-cell and cell-substrate adhesion (flocculation); 

22 flocculation provides protection against various stresses [57,58]. STR variation in yeast 
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1 promoters has been shown to alter gene expression [22]. In Drosophila melanogaster, 

2 Neurospora crassa, and Arabidopsis thaliana, natural coding STR variation in circadian 

3 clock genes alters diurnal rhythmicity and developmental timing [25-27,59]. Some have 

4 proposed that the large phenotypic responses to selection observed in the Canidae are 

5 a consequence of elevated STR mutation rates relative to other mammalian clades 

6 [48,53]. We can state unambiguously that naturally variable STRs underlie dramatic 

7 phenotypic variation in model organisms. 

8 Beyond the observable fact that variable STRs affect phenotype, we can make 

9 specific predictions about the components of phenotypic variation that they affect. Both 

10 theoretical expectations and empirical data indicate that STR variants are likely to 

11 participate in epistatic interactions, and probably more so than most SNVs. One 

12 plausible hypothesis is that STRs act as mutational modifiers of other loci, as may be 

13 expected intuitively from their elevated mutation rate (Box 1, Figure I). 

14 This expectation is borne out in the handful of studies reporting exhaustive 

15 genetic analysis of STRs. For instance, in the Xiphophorus genus of fish, a genetic 

16 incompatibility has recently been attributed to the interaction between the xmrk 

17 oncogene and an STR in the promoter of the tumor suppressor cdkn2a/b [29,60]. If the 

18 xmrk gene product is not properly regulated by cdkn2a/b, fish develop fatal melanomas, 

19 a two-locus Bateson-Dobzhansky-Muller incompatibility described in classic genetic 

20 experiments (Figure 1 A) [61-63]. Expansions in the cdkn2a/b promoter STR are 

21 associated with the presence of a functional copy of the xmrk oncogene across species, 
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1 and are thought to functionally repress the activity of the xmrkgene product through 

2 increased dosage of the tumor suppressor [29]. 

3 Similarly, we have shown that natural variation in the polyQ-encoding ELF3 STR 

4 significantly affects all ELF3-dependent phenotypes in the plant A. thaliana, with ELF3 

5 STR length and phenotype showing a strikingly nonlinear relationship (Figure 1 B)[25]. 

6 Some naturally occurring ELF3 STR variants phenocopy e//3-loss-function mutants in a 

7 common reference background (Figure 1B), suggesting background-specific modifiers. 

8 Indeed, when we compare the phenotypic effects of each ELF3 STR variant between 

9 two divergent backgrounds, Columbia (Col-0) and Wassilewskija (Ws), we find dramatic 

10 differences. The endogenous STR alleles from these two strains (Col-0 7 units, Ws 16 

11 units) show mutual incompatibility when exchanged between backgrounds. The ELF3 

12 protein is thought to function as an "adaptor protein" or physical bridge in diverse protein 

13 complexes [64,65]. We speculated that background-specific polymorphisms in these 

14 interacting proteins underlie the ELF3 STR-dependent background effect. 

15 Also in A. thaliana, a variable STR in the promoter of the CONSTANS gene has 

16 been linked to phenotypic variation in the onset of flowering [28]. CONSTANS encodes 

17 a major regulatory protein that promotes flowering. Transgenic experiments 

18 demonstrate that this regulatory STR variation affects CONSTANS expression and 

19 hence onset of flowering. However, the effects of this STR variation depend on the 

20 presence of a functional allele of FRIGIDA, a negative regulator of flowering that is 

21 highly polymorphic across A. thaliana populations. 
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1 A dramatic example of incompatibility can be found in an intronic repeat in the 

2 IIL1 gene in A. thaliana, which was found to be dramatically expanded in one strain [55]. 

3 The expansion delayed flowering under high temperatures, but when crossed into the 

4 reference genetic background, a strongly interacting locus modifies this phenotype. 

5 In the Drosophila genus, coding STR variation in the per gene co-evolves with 

6 other variants [59,66]. Transgenic flies expressing chimeric per genes with a D. 

7 melanogaster STR domain fused to a D. pseudoobscura flanking region (and vice 

8 versa) have arrhythmic circadian clocks, indicating the modifying effect of flanking 

9 variation in generating an STR-based genetic incompatibility. Among STRs subjected to 

10 exhaustive genetic study, to our knowledge, only the yeast FL01 coding STR has no 

11 known modifiers due to variation in genetic background [57]. 

12 In addition to these exhaustive genetic studies, there are several other 

13 observations that support the role of the genetic background in controlling the 

14 phenotypic effects of STRs. For instance, experiments in Caenorhabditis elegans and 

15 human cells indicate that the phenotypic effects of proteins with expanded polyQ tracts 

16 are modulated by genetic background [67], or by variants in interacting proteins [68]. In 

17 humans, genetic association studies indicate the existence of genetic modifiers of polyQ 

18 expansion disorders for both Huntington's disease [69] and spinocerebellar ataxias [70]. 

19 Taken together, these experimental and observational data support our argument that 

20 functional STRs are likely to be enriched for variants in epistasis with other loci. 

21 STRs with background-dependent phenotypic effects tend to either encode polyQ 

22 tracts or reside in promoter regions. There are good reasons to expect that these STR 
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1 classes might be enriched in DNA/protein-protein interactions that could underlie 

2 epistasis. PolyQ tracts, specifically, often bind DNA surfaces [71], and an analysis of 

3 human protein interactome data found that polyQ-containing proteins engage in more 

4 physical interactions with other proteins than those without polyQs [72]. Similarly, 

5 noncoding STRs in regulatory regions may compensate for mutations in trans-acting 

6 factors, as observed for the STRs in the cdkn2a/b promoter in Xiphophorus [29] and in 

7 the CONSTANS promoter in A. thaliana [28]. We suggest that polymorphisms in protein 

8 interaction partners or in transcriptional regulators are plausible explanations for the 

9 observed background effects. In summary, we expect that STR variation is likely to 

10 contribute a substantial epistatic component to heritability, which has important 

11 implications for their use in explaining phenotypic variation. 
12 

13 Analytical tools and genotyping methods continue to struggle with STR-specific 

14 challenges. 

15 To fulfill the promise of STR variation for explaining heritability, we need 

16 accurate, genome-wide assessment of STR variation in populations of humans and 

17 other organisms. The scientific community has tackled this problem in a flurry of recent 

18 studies describing methods for genotyping STRs genome-wide (Table 1). Specifically, in 

19 the last two years, several analytical tools have been developed to call STR genotypes 

20 from whole-genome-sequencing data [37-39]. These tools attempt to address the two 

21 major challenges for genotyping STRs: poor mappability due to low sequence 

22 complexity and high technical error rate due to amplification stutter. 
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1 To accurately map an STR sequence read and retrieve its unit number genotype, 

2 the sequence read must span the STR of interest and include some unique flanking 

3 sequence. This requirement limits the length of STRs that can be accurately genotyped 

4 and decreases effective STR coverage compared to average whole-genome- 

5 sequencing coverage (Figure 2). For this reason, much of the existing sequencing data, 

6 which consists largely of short reads (36 bp, 50 bp, or 76 bp) with only modest genome 

7 coverage (5-20X) is not suitable for accurate, genome-wide calls of STR genotypes; 

8 only a fraction of STRs, mostly short ones, can be assessed with some confidence 

9 (Figure 2). 

10 Moreover, these analytical tools estimate technical error based on STR 

11 genotypes from sequenced homozygous or haploid genomes, ignoring somatic alleles 

12 within individuals (which are expected for STRs even in primary tissues, occurring at 

13 rates 10 4 -10 5 times higher than SNV somatic mutations) [73-76]. Probabilistic error 

14 models have been formulated to quantify variation arising from technical sources 

15 [37,38], but in the face of somatic STR variation, these models presumably require 

16 substantial read coverage to call germ-line STR genotypes with confidence. However, 

17 because of the low effective coverage of STR loci (Figure 2), STR genotype calls are 

18 based on as few as one to two STR-spanning reads [37,38] (Table 1). Calls based on 

19 so few reads may not be accurate even for homozygous germline alleles. Calling 

20 heterozygous STR genotypes remains difficult with the modest coverage of most 

21 available whole-genome-sequencing data, such as found in the 1000 Genomes Project 

22 [12], which becomes even more challenging when potential somatic mutations 
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1 contribute to a heterogeneous sample population. To illustrate this challenge, consider a 

2 heterozygous ~30 bp-STR locus and whole-genome sequencing with 101 bp-reads at 

3 5x coverage - this scenario is likely to yield just three STR-spanning reads (Figure 2). 

4 These three reads may represent one, two, or three different alleles, representing any 

5 mixture of two different germ-line alleles, somatic alleles, or technical error, making an 

6 accurate call difficult. Consequently, an increase in the sequencing depth of available 

7 data may be required before these tools reach their full potential. 

8 Others have attempted to genotype STRs using whole-genome-sequencing data 

9 from paired-end reads (50bp) of size-selected genomic fragments [39], similar to 

10 strategies used to detect large insertions or deletions [77-80]. This approach is limited 

11 by the resolution of gel electrophoresis in the size selection of DNA fragments. 

12 Consequently, this method cannot determine STR unit number genotypes, but rather 

13 reports whether an STR is variable across samples. The authors argue that this 

14 approach is the most accurate for population-level detection of STR variability [81], but it 

15 is not informative for discerning the relationship between STR unit number genotype 

16 and phenotype. 

17 Although these analysis tools represent important and useful advances, their 

18 limitations illustrate that 'dustbin-diving' of whole-genome-sequencing data may not 

19 suffice for accurate population-scale genotyping of STRs genome-wide. Alternative 

20 approaches that enrich for STR-spanning sequencing reads are needed. Indeed, two 

21 such approaches have been recently published. Both use targeted capture of STRs to 

22 enrich for STR-spanning reads combined with high-throughput sequencing compatible 
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1 with midsize-reads (101 bp, 500 bp) [35,36]. Targeted STR capture requires the design 

2 of STR-specific probes (or rather probes specific to their unique flanking sequences) 

3 and involves additional sequencing, but these approaches can dramatically increase the 

4 number of informative reads, therefore providing substantial STR coverage for accurate 

5 genotyping calls (Table 1). For example, the SureSelect-RNA-probe capture method 

6 reports 27% informative STR-spanning reads compared to the 0.2 % informative reads 

7 found in whole-genome-sequencing data (Table 1). This increase in informative reads is 

8 a major advantage over whole-genome resequencing because STRs represent only a 

9 small fraction of the genome overall [35,36]. Although targeted capture combined with 

10 high-throughput sequencing appears to be a cost-effective alternative for accurate STR 

11 genotyping compared to whole-genome sequencing, distinguishing heterozygous 

12 alleles, somatic variants, and technical error remains a challenge. We suggest that 

13 recent innovations in single-molecule targeted capture [82] should be useful in 

14 distinguishing these categories and in further increasing enrichment of informative, 

15 STR-spanning reads. 

16 

17 Lack of statistical models for detecting STR-phenotype associations in GWA. 

18 Assuming that we obtain accurate, population-scale genotype data for STRs, we 

19 may not yet have statistical tools appropriate for detecting STR associations with 

20 phenotype [8]. In diploid organisms, a biallelic SNV is typically analyzed by modeling 

21 phenotype as a function of the number of non-reference alleles at that locus (0, 1 , or 2) 

22 in each individual. A null hypothesis of no monotonic relationship between phenotype 

23 and the allele count is then formulated and tested [83]. This framework cannot 
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1 accommodate more than two alleles, which we would expect for many STRs. Simply 

2 using tagged SNVs linked to STRs to perform GWA is unfeasible, because linkage 

3 disequilibrium decays very quickly between SNVs and STRs across human populations 

4 [12]. 

5 To address these complications, a previous study attempted GWA between STR 

6 genotypes and human disease phenotypes by comparing relative frequencies of various 

7 alleles in pooled DNA from cases and controls [84]. By pooling samples, this approach 

8 eases the analysis of multiallelic loci, but it loses information by ignoring specific 

9 individuals. 

10 In a more recent study, the authors used logistic regression and the analysis of 

11 variance to detect associations between STR alleles and quantitative phenotypes in an 

12 inbred Drosophila mapping population [13]. Given that significant associations were 

13 detected, such approaches may be sufficiently powerful in recombinant inbred lines. 

14 However, their strategy relied on homozygosity, and considered multiallelic STRs in a 

15 pairwise fashion, so these straightforward methods will lose power with outbred 

16 populations and multiallelic STRs. 

17 The central confounder of these studies is that most STRs of appreciable 

18 variability (and thus, interest) are multiallelic, as a simple consequence of the STR 

19 mutational mechanism [17]. This multiallelic feature could be accommodated by treating 

20 STR alleles categorically, but this choice entails a corresponding reduction in power, 

21 because many alleles are rare. 
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1 Some studies have reported linear associations between STR unit number and 

2 quantitative phenotypes [27,57], suggesting that using simple tests of linear correlation 

3 between these variables may be a powerful option. However, this linearity (or even 

4 monotonicity) of the relationship between STR unit number genotype and phenotype is 

5 a poorly-supported assumption [25]. Nonetheless, STR unit number is a numerical 

6 variable, and it would be preferable to gain power from treating it as such. For instance, 

7 more similar STR unit number genotypes might be associated with more similar 

8 phenotypes, but this intuition may be difficult to generalize. 

9 Lastly, both intuition (Box 1 ) and the studies discussed above lead us to expect 

10 that relatively many phenotypically relevant variable STRs will show epistasis with other 

11 loci. This epistasis will reduce power in tests of association between STRs and 

12 phenotype [85], given the inadequacy of the current paradigm of quantitative genetics in 

13 detecting and modeling the effects of epistasis [85,86]. At present, targeted and 

14 exhaustive genetic studies (as described above) are the only effective method for 

15 understanding the effects of epistasis. 

16 In total, these obstacles present a daunting challenge for the integration of STR 

17 genotypes into the current genotype-phenotype maps. Overall, we call for a reappraisal 

18 of statistical methodologies for use in GWA with STR variation to account for these 

19 various STR-specific confounders. 

20 

21 Somatic STR variation may be a sensitive marker for increased disease 

22 susceptibility. 
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1 It has been appreciated for some time that the high STR mutation rate leads to 

2 somatic variation within individuals in addition to germ-line variation between individuals 

3 [71]. This somatic STR variation is particularly noticeable in tumor tissues, but is also 

4 measurable in primary tissues [73,87]. While these findings immediately led to systems 

5 of classification for tumor types and clones [76,88,89], the investigation of somatic STR 

6 variation (or MSI) may also inform us about general phenotypic states and disease 

7 susceptibility. 

8 Patients with various complex diseases tend to carry a greater load of rare germ- 

9 line variants than unaffected control groups [6]. It is widely assumed that these rare 

10 variants contribute in some fashion to these disorders [90]; however, an alternative 

11 interpretation holds that they are signs of stochastic genome instability, which when 

12 increased leads to higher susceptibility to complex diseases. [6]. Increased genome 

13 instability will increase somatic variation, which may then serve as a read-out of disease 

14 susceptibility [6].This alternative interpretation has some support from empirical data. 

15 For instance, perturbation of the molecular chaperone Hsp90, which stabilizes diverse 

16 DNA repair proteins, leads to increased somatic STR mutation rates in human cells; in 

17 various model organisms Hsp90 perturbation increases transposon mobility and 

18 intrachromosomal homologous recombination [31-34]. Hsp90 perturbation also 

19 increases the penetrance of many genetic variants in flies, plants, fish, worms and 

20 yeast, suggesting that increased genome instability and increased phenotypic 

21 heritability are associated [34]. If this association also applies to disease phenotypes, 

22 increased genome instability may predict higher disease susceptibility. 
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1 Consequently, although somatic MSI may not be the cause of disease 

2 phenotypes, it may serve as a biomarker for individuals who are more vulnerable to 

3 environmental and genetic perturbations leading to disease. Again, this strategy hinges 

4 on the development of cost-effective technologies for screening panels of STRs for 

5 somatic mutations across many humans, which will require new strategies to distinguish 

6 technical error from somatic STR variation. 

7 Another possibility is that somatic variation is itself phenotypically relevant, or 

8 even plays a role in developmental processes. It is known that STRs are enriched in 

9 genes with neuronal function [91]; some have even proposed that such somatic 

10 mutation is a component of normal neuronal development in humans [92]. If this is the 

11 case, then a greater appreciation of somatic variation will be necessary to understand 

12 canonical developmental processes. Collectively, STR variation within (in addition to 

13 between) individuals has great potential as a read-out for disease susceptibility, and 

14 perhaps also as a cause of phenotypic variation itself. 
15 

16 Concluding remarks 

17 The study of STRs and other under-ascertained genomic elements has the potential to 

18 reshape our model of the heritability of complex diseases and traits, both in terms of the 

19 overall proportion of heritability explained, and in terms of the components of heritability 

20 themselves (Outstanding Questions). Experimental studies in model organisms have 

21 taught us that the phenotypic effects of genome-wide STR variation are both dramatic 

22 and impossible to understand without taking epistasis into account. In the future, our 
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1 understanding will be improved by 1) accurate STR population-scale and somatic 

2 genotyping, 2) more appropriate statistical methods for analyzing STR-phenotype 

3 associations, and 3) a broader description of epistasis between STR variation and other 

4 loci in determining phenotype. 

5 

6 OUTSTANDING QUESTIONS 

7 • In light of wide-spread epistasis, what statistical and experimental tools 

8 can quantify the effect of STR variation on phenotype? 

9 • Can inexpensive, accurate tools be developed for germ-line and somatic 

10 STR genotyping? 

11 • Will somatic STR variation be effective as a readout for disease 

12 susceptibility? 

13 

14 GLOSSARY 

15 Short tandem repeat (STR): a repetitive nucleotide sequence that consists of many 

16 copies of a short sequence in tandem (ex. CAGCAGCAGCAG). STRs are frequently 

17 called microsatellites. 

18 Single nucleotide variant (SNV): Variant that consists of a change at a single 

19 nucleotide position. Common SNVs are sometimes called single nucleotide 

20 polymorphisms (SNPs). 

21 Heritability: The fraction of variation in a phenotype across a population that can be 

22 attributed to genetic differences. 
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1 Epistasis: Non-reciprocal interactions of non-allelic gene variants, due for instance to 

2 functional interdependence between gene products in a protein complex or metabolic 

3 pathway. 

4 Genome-wide association (GWA): A set of methods by which each of a large number 

5 of genetic variants genome-wide is tested for statistical associations with a phenotype. 

6 Often referred to in the context of genome-wide association studies (GWAS). 

7 Complex disease, complex traits: Complex diseases or traits are phenotypic 

8 characters thought to be affected by multiple genetic and environmental factors. 

9 Somatic variation: Genetic variation across somatic cells or tissues of an organism, 

10 which are generally not inherited by offspring (which inherits instead germ-line 

11 variation). Generally arises from mutations in specific cell lineages after early 

12 development. 

13 Microsatellite instability (MSI): Somatic variation of STRs (microsatellites) associated 

14 with phenotypic changes such as cancer, often due to mutations in DNA repair genes. 

15 Bateson-Dobzhansky-Muller incompatibility: Hybrid incompatibilities observed when 

16 crossing two close species or divergent strains of a species with one another. Caused 

17 by the co-segregation of non-parental allele combinations, resulting in a dysfunctional 

18 genetic interaction (negative epistasis). 

19 Genetic anticipation: A mode of disease inheritance characterized by progressively 

20 earlier ages of disease onset as generations progress. Generally caused by the gradual 

21 expansion of STRs. 

22 
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12 Table 1 . Technologies for assessing STR variation by high-throughput 

13 sequencing. 
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14 a: Minimum coverage of a single STR that is considered sufficient to call a genotype. 

15 b: Sequence data from lllumina HiSeq technology. 

16 c: data references: [93,94] 

17 d: data references: [95,96] 

18 e: data references: [97,98] 

19 
20 
21 



22 BOX 1 : Modifier mutations leading to epistasis are expected in STRs. 

23 We have previously proposed that STRs might be more susceptible to genetic 

24 interactions [25], as we will briefly explicate here. Consider a simple two-locus haploid 
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1 model under panmixis, in which loci A and B each start with a single allele (ab) and 

2 have the same probability p per generation of mutating to a second allele (a* or b*), with 

3 p also as the probability per generation of reverting mutations (Figure I). Let us further 

4 assume that A and B are in sign epistasis [99] (that is, a*b and/or ab* have fitness less 

5 than ab and a*b*). To escape the unfavorable a*b genotype, the organism may either 

6 revert to ab or mutate forward to a*b*. When the A and B loci have equal mutation rates, 

7 we expect that the reversion of a single mutant is just as likely as a second mutation, 

8 and consequently that a*b* individuals will appear only relatively rarely and slowly. 

9 However, consider a similar model, in which locus B has an elevated mutation rate Pb > 

10 p a . In this case, the a*b genotype has a higher probability of a second, modifying 

11 mutation to a*£>*than of a reversion to ab. Moreover, flux along the other mutational 

12 path (ab -> ab* -> a*b*) will be increased. In sum, a *b* genotypes will arise at higher 

13 rates, and will attain their equilibrium frequency much more rapidly, if either A or B has 

14 an elevated mutation rate [100] (p. 131). This scenario can lead quickly to an equilibrium 

15 population in which incompatible epistatic alleles are frequent, even though 

16 recombinants have lower fitness. Relaxing the assumption of no population structure will 

17 further speed this process. Consequently, we would expect STRs and other loci with 

18 high mutation rates to be more likely to modify other alleles than loci with lower mutation 

19 rates, as long as we assume that all loci are equally capable of genetic interactions. 

20 This process may be referred to as 'coadaptation'. For a rigorous model of the evolution 

21 of hybrid incompatibility, see Orr [101]. 
22 
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Figure I. A locus with higher mutation rates allows genetic modification of unfavorable 
genotypes at interacting loci. Top, a model of evolution under epistasis with only one 
slow mutation rate. Middle, a model of evolution under epistasis with a slow and a fast 
mutation rate. Boxes represent loci, stars represent SNV-type mutations, black and 
white checkering indicates an STR locus (a/b, a*/b, and a*/b* signify different 
genotypes). Arrows with numbers represent possible mutations and their respective 
rates. Bottom, fitness of each genotype under both models. We expect that the model 
with two mutation rates will occupy the fully derived state (a*/b*) more quickly. 
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1 



2 




3 Figure 1. Genetic and transgenic analysis reveals STR-mediated 

4 incompatibilities. A, the Gordon-Kosswig-Anders cross shows a genetic incompatibility 

5 between two fish species in the Xiphophorus genus. Modified from Meierjohann and 

6 Schartl [63]. F-i hybrids back-crossed to their X. helleri parent yield a 3:1 ratio of viability, 

7 where the inviables result from co-segregation of the functional xmrk gene and a short 

8 STR allele in the cdkn2a/b promoter. Shading indicates melanism conferred by xmrk. B, 

9 genetic background is epistatic to effects of ELF3 STR variation in A. thaliana. 

10 Expression-matched transgenic plants with various alleles of the ELF3 STR in the 

11 Columbia (Col-0) and Wassilewskija (Ws) backgrounds, showing endogenous, 

12 exogenous, and synthetic ("0") alleles in each background [25]. White boxes indicate 

13 transgenic plants carrying the ELF3 STR endogenous to their respective background; 

14 white arrowheads indicate early-flowering ELF3 STR genotypes (elf3 mutants and 

15 poorly-functioning ELF3 STR alleles confer early flowering). 
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1 

2 Figure 2. Effective reduction in STR coverage in whole-genome sequencing. 

3 Expected coverage of STRs for various sequencing depths and read lengths. We 

4 assumed 8 bp of flanking sequence on either side (per requirement for LobSTR 

5 software [38]). Black bars indicate nominal sequencing coverage for each scenario. 4- 

6 5X coverage (left panel) is typical for genomes in the human 1000 Genomes Project 

7 [95]; 15-20X coverage is typical for genomes in the A. thaliana 1001 Genomes Project 

8 [97,98]. 
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